Marketing Campaign Analysis¶

Problem Definition¶

The Context:¶

  • Segmentation is a valuable tool for understanding the current market and customers.By breaking down a diverse market into distinct segments based on relevant characteristics, such as demographics, behavior, or preferences, businesses can gain deeper insights into their customers' needs, motivations, and preferences. This enables targeted marketing strategies, personalized product offerings, and effective communication approaches. By understanding the unique characteristics and preferences of different customer segments, businesses can tailor their efforts to maximize customer satisfaction, loyalty, and ultimately drive business growth.

The objective:¶

  • Identify different segments in the existing customer base of the marketing campaign dataset, taking into account their spending patterns as well as past interactions with the company.

The key questions:¶

  • What are the different customer segments and how much do they spend on average?
  • What are the key features that describe each segment?
  • Which segments should we focus on? Are there any we should avoid?

The problem formulation:¶

  • Clean Data
  • Perform EDA
  • Perform PCA on the original variables to obtain a set of linearly independent variables. Then, utilize these transformed variables for the purpose of clustering
  • Build various clustering algorithms: K-Means, K-Medoids, Hierarchical, DBSCAN, and Gaussian Mixture Model
  • Evaluate cluesting algorithms using Silhouette score and value of insights.

Data Dictionary¶


The dataset contains the following features:

  1. ID: Unique ID of each customer
  2. Year_Birth: Customer’s year of birth
  3. Education: Customer's level of education
  4. Marital_Status: Customer's marital status
  5. Kidhome: Number of small children in customer's household
  6. Teenhome: Number of teenagers in customer's household
  7. Income: Customer's yearly household income in USD
  8. Recency: Number of days since the last purchase
  9. Dt_Customer: Date of customer's enrollment with the company
  10. MntFishProducts: The amount spent on fish products in the last 2 years
  11. MntMeatProducts: The amount spent on meat products in the last 2 years
  12. MntFruits: The amount spent on fruits products in the last 2 years
  13. MntSweetProducts: Amount spent on sweet products in the last 2 years
  14. MntWines: The amount spent on wine products in the last 2 years
  15. MntGoldProds: The amount spent on gold products in the last 2 years
  16. NumDealsPurchases: Number of purchases made with discount
  17. NumCatalogPurchases: Number of purchases made using a catalog (buying goods to be shipped through the mail)
  18. NumStorePurchases: Number of purchases made directly in stores
  19. NumWebPurchases: Number of purchases made through the company's website
  20. NumWebVisitsMonth: Number of visits to the company's website in the last month
  21. AcceptedCmp1: 1 if customer accepted the offer in the first campaign, 0 otherwise
  22. AcceptedCmp2: 1 if customer accepted the offer in the second campaign, 0 otherwise
  23. AcceptedCmp3: 1 if customer accepted the offer in the third campaign, 0 otherwise
  24. AcceptedCmp4: 1 if customer accepted the offer in the fourth campaign, 0 otherwise
  25. AcceptedCmp5: 1 if customer accepted the offer in the fifth campaign, 0 otherwise
  26. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  27. Complain: 1 If the customer complained in the last 2 years, 0 otherwise

Note: You can assume that the data is collected in the year 2016.

Import the necessary libraries and load the data¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.linear_model import LinearRegression
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
from scipy.spatial.distance import pdist
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score




%matplotlib inline 
import warnings
warnings.filterwarnings("ignore")
In [2]:
# Function to create barplots that indicates percentage for each category
def bar_perc(data, column):
    
    total = len(data[column]) # Length of the column
    plt.figure(figsize = (15, 5))
    
    # plt.xticks(rotation = 45) 
    ax = sns.countplot(data[column], palette = 'Paired')
    
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height() / total) # Percentage of each class
        x = p.get_x() + p.get_width() / 2 - 0.05                    # Width of the plot
        y = p.get_y() + p.get_height()                              # Height of the plot
        ax.annotate(percentage, (x, y), size = 12)                  # Annotate the percentage 
        
    plt.show()   
In [3]:
# This function takes the numerical variable as the input and returns the boxplots and histograms for that variable
# This would help us write faster and cleaner code

def histogram_boxplot(feature, figsize = (15,10), bins = None):
    """ Boxplot and histogram combined
    feature: 1-d feature array
    figsize: size of fig (default (9,8))
    bins: number of bins (default None / auto)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2,     # Number of rows of the subplot grid
                                           sharex = True, # The X-axis will be shared among all the subplots
                                           gridspec_kw = {"height_ratios": (.25, .75)}, 
                                           figsize = figsize 
                                           ) 
    # Creating the subplots
    # Boxplot will be created and the mean value of the column will be indicated using some symbol
    sns.boxplot(feature, ax = ax_box2, showmeans = True, color ='red')
    
    # For histogram
    sns.histplot(feature, kde = True, ax = ax_hist2, bins = bins) if bins else sns.histplot(feature, kde = True, ax = ax_hist2)
    ax_hist2.axvline(np.mean(feature), color = 'g', linestyle = '--')      # Add mean to the histogram
    ax_hist2.axvline(np.median(feature), color = 'black', linestyle = '-') # Add median to the histogram
    
    # Calculate the boundaries of outliers
    iqr = np.percentile(feature, 75) - np.percentile(feature, 25)
    lower_bound = np.percentile(feature, 25) - 1.5 * iqr
    upper_bound = np.percentile(feature, 75) + 1.5 * iqr
    
    # Get the minimum and maximum values (boundaries of outliers)
    minimum = np.min(feature[feature >= lower_bound])
    maximum = np.max(feature[feature <= upper_bound])
    
    print('lower_bound: ', minimum)
    print('upper_bound', maximum)

Data Overview¶

  • Reading the dataset
  • Understanding the shape of the dataset
  • Checking the data types
  • Checking for missing values
  • Checking for duplicated values
  • Drop the column which has no null values
In [4]:
#__________________
# read the datafile

pd.set_option('display.max_columns', None)


df_initial = pd.read_csv('marketing_campaign+%284%29.csv')
print('Dataframe dimensions:', df_initial.shape)
if df_initial.duplicated().any():
    print('hasDuplicates:', df_initial.duplicated().sum())
else:
    print('hasDuplicates:', False)
#____________________________________________________________
# gives some infos on columns types and numer of null values
tab_info = pd.concat([pd.DataFrame(df_initial.dtypes).T.rename(index={0:'column type'}), 
                      pd.DataFrame(df_initial.isnull().sum()).T.rename(index={0:'null values (nb)'}), 
                      pd.DataFrame(df_initial.isnull().sum()/df_initial.shape[0]*100).T.rename(index={0:'null values (%)'})])

display(tab_info)

#__________________
# show first lines
display(df_initial[:5])
Dataframe dimensions: (2240, 27)
hasDuplicates: False
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
column type int64 int64 object object float64 int64 int64 object int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64
null values (nb) 0 0 0 0 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
null values (%) 0.0 0.0 0.0 0.0 1.071429 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 88 546 172 88 88 3 8 10 4 7 0 0 0 0 0 0 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 1 6 2 1 6 2 1 1 2 5 0 0 0 0 0 0 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 49 127 111 21 42 1 8 2 10 4 0 0 0 0 0 0 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 4 20 10 3 5 2 2 0 4 6 0 0 0 0 0 0 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 43 118 46 27 15 5 5 3 6 5 0 0 0 0 0 0 0
In [5]:
df_initial.describe()
Out[5]:
ID Year_Birth Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
count 2240.000000 2240.000000 2216.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000 2240.000000
mean 5592.159821 1968.805804 52247.251354 0.444196 0.506250 49.109375 303.935714 26.302232 166.950000 37.525446 27.062946 44.021875 2.325000 4.084821 2.662054 5.790179 5.316518 0.072768 0.074554 0.072768 0.064286 0.012946 0.009375 0.149107
std 3246.662198 11.984069 25173.076661 0.538398 0.544538 28.962453 336.597393 39.773434 225.715373 54.628979 41.280498 52.167439 1.932238 2.778714 2.923101 3.250958 2.426645 0.259813 0.262728 0.259813 0.245316 0.113069 0.096391 0.356274
min 0.000000 1893.000000 1730.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2828.250000 1959.000000 35303.000000 0.000000 0.000000 24.000000 23.750000 1.000000 16.000000 3.000000 1.000000 9.000000 1.000000 2.000000 0.000000 3.000000 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 5458.500000 1970.000000 51381.500000 0.000000 0.000000 49.000000 173.500000 8.000000 67.000000 12.000000 8.000000 24.000000 2.000000 4.000000 2.000000 5.000000 6.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 8427.750000 1977.000000 68522.000000 1.000000 1.000000 74.000000 504.250000 33.000000 232.000000 50.000000 33.000000 56.000000 3.000000 6.000000 4.000000 8.000000 7.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 11191.000000 1996.000000 666666.000000 2.000000 2.000000 99.000000 1493.000000 199.000000 1725.000000 259.000000 263.000000 362.000000 15.000000 27.000000 28.000000 13.000000 20.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

Observations and Insights from the Data overview: _¶

  1. The data frame has 2,240 columns and 27 columns
  2. Income has 1% null values. Need to decide if will impute or remove.
  3. No duplicates found.
  4. Need to convert Dt_Customer to datetime
In [6]:
# To remove the NA values
df = df_initial.dropna()

# Number of columns that contain null values in df_initial
num_null_columns = df_initial.isnull().any().sum()


num_rows_dropped = len(df_initial) - len(df)
num_remaining_rows = len(df)

# Determine pluralization for columns
column_plural = "column" if num_null_columns == 1 else "columns"

# Print the results
print(f'Dropped {num_rows_dropped} rows from {num_null_columns} {column_plural}. The total remaining rows are {num_remaining_rows:,}.')
Dropped 24 rows from 1 column. The total remaining rows are 2,216.
In [7]:
# Convert Dt_Customer column to datetime
df_initial['Dt_Customer'] = pd.to_datetime(df_initial['Dt_Customer'], format='%d-%m-%Y', errors='coerce')
In [8]:
df_initial[df_initial.isnull().any(axis=1)]
Out[8]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
10 1994 1983 Graduation Married NaN 1 0 2013-11-15 11 5 5 6 0 2 1 1 1 0 2 7 0 0 0 0 0 0 0
27 5255 1986 Graduation Single NaN 1 0 2013-02-20 19 5 1 3 3 263 362 0 27 0 0 1 0 0 0 0 0 0 0
43 7281 1959 PhD Single NaN 0 0 2013-11-05 80 81 11 50 3 2 39 1 1 3 4 2 0 0 0 0 0 0 0
48 7244 1951 Graduation Single NaN 2 1 2014-01-01 96 48 5 48 6 10 7 3 2 1 4 6 0 0 0 0 0 0 0
58 8557 1982 Graduation Single NaN 1 0 2013-06-17 57 11 3 22 2 2 6 2 2 0 3 6 0 0 0 0 0 0 0
71 10629 1973 2n Cycle Married NaN 1 0 2012-09-14 25 25 3 43 17 4 17 3 3 0 3 8 0 0 0 0 0 0 0
90 8996 1957 PhD Married NaN 2 1 2012-11-19 4 230 42 192 49 37 53 12 7 2 8 9 0 0 0 0 0 0 0
91 9235 1957 Graduation Single NaN 1 1 2014-05-27 45 7 0 8 2 0 1 1 1 0 2 7 0 0 0 0 0 0 0
92 5798 1973 Master Together NaN 0 0 2013-11-23 87 445 37 359 98 28 18 1 2 4 8 1 0 0 0 0 0 0 0
128 8268 1961 PhD Married NaN 0 1 2013-07-11 23 352 0 27 10 0 15 3 6 1 7 6 0 0 0 0 0 0 0
133 1295 1963 Graduation Married NaN 0 1 2013-08-11 96 231 65 196 38 71 124 1 6 5 7 4 0 0 0 0 0 0 0
312 2437 1989 Graduation Married NaN 0 0 2013-06-03 69 861 138 461 60 30 61 1 6 5 12 3 0 1 0 1 0 0 0
319 2863 1970 Graduation Single NaN 1 2 2013-08-23 67 738 20 172 52 50 20 6 2 3 10 7 0 1 0 1 0 0 0
1379 10475 1970 Master Together NaN 0 1 2013-04-01 39 187 5 65 26 20 14 2 4 2 6 5 0 0 0 0 0 0 0
1382 2902 1958 Graduation Together NaN 1 1 2012-09-03 87 19 4 12 2 2 6 1 1 0 3 5 0 0 0 0 0 0 0
1383 4345 1964 2n Cycle Single NaN 1 1 2014-01-12 49 5 1 9 2 0 4 1 1 0 2 7 0 0 0 0 0 0 0
1386 3769 1972 PhD Together NaN 1 0 2014-03-02 17 25 1 13 0 0 3 1 1 0 3 7 0 0 0 0 0 0 0
2059 7187 1969 Master Together NaN 1 1 2013-05-18 52 375 42 48 94 66 96 7 4 10 4 3 0 0 0 0 0 0 0
2061 1612 1981 PhD Single NaN 1 0 2013-05-31 82 23 0 15 0 2 7 2 3 0 3 6 0 0 0 0 0 0 0
2078 5079 1971 Graduation Married NaN 1 1 2013-03-03 82 71 1 16 0 0 9 4 2 1 3 8 0 0 0 0 0 0 0
2079 10339 1954 Master Together NaN 0 1 2013-06-23 83 161 0 22 0 0 24 2 4 1 4 6 0 0 0 0 0 0 0
2081 3117 1955 Graduation Single NaN 0 1 2013-10-18 95 264 0 21 12 6 147 3 6 1 5 7 0 0 0 0 0 0 0
2084 5250 1943 Master Widow NaN 0 0 2013-10-30 75 532 126 490 164 126 126 1 5 5 11 1 0 0 1 0 0 0 1
2228 8720 1978 2n Cycle Together NaN 0 0 2012-08-12 53 32 2 1607 12 4 22 0 0 0 1 0 0 1 0 0 0 0 0

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables
  2. Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights
  3. Are all categories different from each other or can we combine some categories? Is 2n Cycle different from Master?
  4. There are 8 categories in Marital_Status with some categories having very low count of less than 5. Can we combine these categories with other categories?
In [9]:
def describe_dataframe(df, include_number=True):
    if include_number:
        df_desc = df.describe(include='number').T
    else:
        df_desc = df.describe(exclude='number').T

    df_col = df_desc.index.tolist()

    for column in df_col:
        print(df[column].value_counts())
        print('-' * 50)
        
df.describe(exclude = 'number').T
Out[9]:
count unique top freq
Education 2216 5 Graduation 1116
Marital_Status 2216 8 Married 857
Dt_Customer 2216 662 31-08-2012 12

Observation¶

  1. Education column has a count of 2,216, with 5 unique values and the top value is "Graduation."
  2. Marital_Status column has a count of 2216, with 8 unique values and the top value is "Married."
  3. Dt_Customer column has a count of 2216, with 662 unique values, and the top value is "31-08-2012."
In [10]:
# Replace 2n Cycle for Master
df.loc[df['Education'] == '2n Cycle', 'Education'] = 'Master'

# Grouped small value counts as Other
df.loc[df['Marital_Status'].isin(['Alone', 'Absurd', 'YOLO']), 'Marital_Status'] = 'Single'
In [11]:
df.describe(include = 'number').T
Out[11]:
count mean std min 25% 50% 75% max
ID 2216.0 5588.353339 3249.376275 0.0 2814.75 5458.5 8421.75 11191.0
Year_Birth 2216.0 1968.820397 11.985554 1893.0 1959.00 1970.0 1977.00 1996.0
Income 2216.0 52247.251354 25173.076661 1730.0 35303.00 51381.5 68522.00 666666.0
Kidhome 2216.0 0.441787 0.536896 0.0 0.00 0.0 1.00 2.0
Teenhome 2216.0 0.505415 0.544181 0.0 0.00 0.0 1.00 2.0
Recency 2216.0 49.012635 28.948352 0.0 24.00 49.0 74.00 99.0
MntWines 2216.0 305.091606 337.327920 0.0 24.00 174.5 505.00 1493.0
MntFruits 2216.0 26.356047 39.793917 0.0 2.00 8.0 33.00 199.0
MntMeatProducts 2216.0 166.995939 224.283273 0.0 16.00 68.0 232.25 1725.0
MntFishProducts 2216.0 37.637635 54.752082 0.0 3.00 12.0 50.00 259.0
MntSweetProducts 2216.0 27.028881 41.072046 0.0 1.00 8.0 33.00 262.0
MntGoldProds 2216.0 43.965253 51.815414 0.0 9.00 24.5 56.00 321.0
NumDealsPurchases 2216.0 2.323556 1.923716 0.0 1.00 2.0 3.00 15.0
NumWebPurchases 2216.0 4.085289 2.740951 0.0 2.00 4.0 6.00 27.0
NumCatalogPurchases 2216.0 2.671029 2.926734 0.0 0.00 2.0 4.00 28.0
NumStorePurchases 2216.0 5.800993 3.250785 0.0 3.00 5.0 8.00 13.0
NumWebVisitsMonth 2216.0 5.319043 2.425359 0.0 3.00 6.0 7.00 20.0
AcceptedCmp3 2216.0 0.073556 0.261106 0.0 0.00 0.0 0.00 1.0
AcceptedCmp4 2216.0 0.074007 0.261842 0.0 0.00 0.0 0.00 1.0
AcceptedCmp5 2216.0 0.073105 0.260367 0.0 0.00 0.0 0.00 1.0
AcceptedCmp1 2216.0 0.064079 0.244950 0.0 0.00 0.0 0.00 1.0
AcceptedCmp2 2216.0 0.013087 0.113672 0.0 0.00 0.0 0.00 1.0
Complain 2216.0 0.009477 0.096907 0.0 0.00 0.0 0.00 1.0
Response 2216.0 0.150271 0.357417 0.0 0.00 0.0 0.00 1.0

Observation¶

  1. It would be great to get age column form Year_Birth
  2. It would be useful to crteate total columns: total_kids, total_mnt, total_products, total_accepted_cmp

Univariate Analysis on Numerical and Categorical data¶

Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.

  • Plot histogram and box plot for different numerical features and understand how the data looks like.
  • Explore the categorical variables like Education, Kidhome, Teenhome, Complain.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. How does the distribution of Income variable vary across the dataset?
  2. The histogram and the box plot are showing some extreme value on the right side of the distribution of the 'Income' feature. Can we consider them as outliers and remove or should we analyze these extreme values?
  3. There are only a few rows with extreme values for the Income variable. Is that enough information to treat (or not to treat) them? At what percentile the upper whisker lies?
In [12]:
cat_cols = df.describe(exclude='number').T.index.tolist()
num_cols = df.describe(include='number').T.index.tolist()
In [13]:
cat_cols
Out[13]:
['Education', 'Marital_Status', 'Dt_Customer']
In [14]:
bar_perc(df, 'Education')

Observation¶

  1. 50% of customers have a Education value of Graduation.
In [15]:
bar_perc(df, 'Kidhome')

Observation¶

  1. 0 and 1 kid home is most common.
In [16]:
bar_perc(df, 'Teenhome')

Observation¶

  1. 0 and 1 teen home is most common.
In [17]:
bar_perc(df, 'Complain')

Observation¶

  1. Most customers did not complain.
In [18]:
bar_perc(df, 'Marital_Status')

Observation¶

  1. 39% of customers are Married.
  2. 65% of customers are Married or Together.
In [19]:
num_cols
Out[19]:
['ID',
 'Year_Birth',
 'Income',
 'Kidhome',
 'Teenhome',
 'Recency',
 'MntWines',
 'MntFruits',
 'MntMeatProducts',
 'MntFishProducts',
 'MntSweetProducts',
 'MntGoldProds',
 'NumDealsPurchases',
 'NumWebPurchases',
 'NumCatalogPurchases',
 'NumStorePurchases',
 'NumWebVisitsMonth',
 'AcceptedCmp3',
 'AcceptedCmp4',
 'AcceptedCmp5',
 'AcceptedCmp1',
 'AcceptedCmp2',
 'Complain',
 'Response']
In [20]:
histogram_boxplot(df.Income, figsize = (15,10), bins = None)
lower_bound:  1730.0
upper_bound 113734.0

Observation¶

  1. Most of the cutomers have a income range of 0 to $114,000.
  2. It seems to be bimodal.
  3. There are some extreme outliers for income and we will analyze them. We can remove them later if new information comes up from the analysis suchc as they are insignificant for business.
In [21]:
df = df[df['Income'] < 200000]
In [22]:
histogram_boxplot(df.Income, figsize = (15,10), bins = None)
lower_bound:  1730.0
upper_bound 113734.0
In [23]:
histogram_boxplot(df.Year_Birth, figsize = (15,10), bins = None)
lower_bound:  1940
upper_bound 1996

Observation¶

  1. A few very old customers
In [24]:
histogram_boxplot(df.NumWebPurchases, figsize = (15,10), bins = None)
lower_bound:  0
upper_bound 11

Observations¶

  1. Most of the customers have fewer Number of Web Purchases.
  2. There are a few individuals who stand out as outliers. In other words, these customers make a significantly larger number of purchases compared to the majority of our customer base.
In [25]:
histogram_boxplot(df.Year_Birth, figsize = (15,10), bins = None)
lower_bound:  1940
upper_bound 1996

Observations¶

  1. There are outliers which might be fake years of birth.
  2. There highest concentration of customers seem to be cborn between 1970-1980.
In [26]:
histogram_boxplot(df.NumCatalogPurchases, figsize = (15,10), bins = None)
lower_bound:  0
upper_bound 10

Observation¶

  1. Most customers have a few or no Catalog purchases.
  2. There are a few customers who make significantly more number of catalog purchases.
In [27]:
histogram_boxplot(df.NumStorePurchases, figsize = (15,10), bins = None)
lower_bound:  0
upper_bound 13

Observation¶

  1. Most of the customers have lower number of store purchases but there are still many customers with higher purchaeses.
In [28]:
histogram_boxplot(df.NumWebVisitsMonth, figsize = (15,10), bins = None)
lower_bound:  0
upper_bound 13

Observations¶

  1. There are a few customers that have significantly more web visits.
  2. Most customers tend to have higher number of web visits.

Overall Observations:¶

  1. The individual columns have good variance which would increase the effectiveness of kmeans or dbscan.
  2. There are outliers, but clusting should be able to handle it. We can revisit if neeed later.

Dimensionality Reduction and Clustering techniques¶

  1. PCA: Help identify key columns for clustering algorithms
  2. Clustering techniques
    • K-means
    • Kmedoid
    • Dendrogram
    • DBSCAN
    • Gaussian Mixture Model

Value of Clusteing Techniques¶

  1. Segmentation: The clustering techniques can help identify distinct groups or segments.
  2. Customer Profiling: Clustering algorithms allow us to create customer profiles based on their behavior, preferences, or other relevant features.
  3. Campaign Targeting: Once you have segmented your customer base using clustering, you can use the resulting groups to gain business insights and target specific marketing campaigns.

Key Measures of success¶

  1. Are the segments generated distinct and meaningful, with clear differentiation from one another? A greater number of distinct groups in the boxplots indicates better differentiation.
  2. Do the segments align with our marketing goals and objectives, and can we take specific marketing actions to target and engage each segment effectively?
  3. What is the size and reach of each segment, and do they have enough potential customers to justify targeted marketing efforts?
  4. How homogeneous are the customers within each segment, and are there significant differences between segments?
  5. At the end of the day, did it increase revenue, reduce cost, or a combination of both?
In [29]:
# describe_dataframe(df, include_number=False)

Bivariate Analysis¶

  • Analyze different categorical and numerical variables and check how different variables are related to each other.
  • Check the relationship of numerical variables with categorical variables.
In [30]:
plt.figure(figsize = (16, 16))

sns.heatmap(df.corr(), annot = True, fmt = '0.2f') # Shows correlation between variables.

plt.show()
Observation¶
  1. Strong correlation between NumCatalogPurchases and MntMeatProducts.
  2. Similarly with NumStorePurchases and MntWines
In [31]:
# Relationship between food_preparation_time and cuisine type
plt.figure(figsize=(15,7))
sns.boxplot(x = "Education", y = "Income", data = df, palette = 'PuBu')
plt.xticks(rotation = 60)
plt.show()

Observation¶

  1. Basic education makes the least income.
In [32]:
# Relationship between food_preparation_time and cuisine type
plt.figure(figsize=(15,7))
sns.boxplot(x = "Marital_Status", y = "Income", data = df, palette = 'PuBu')
plt.xticks(rotation = 60)
plt.show()

Observation¶

  1. Widow has the inner quartile with the highest income.
  2. No clear diffrentiation.

Feature Engineering and Data Processing¶

In this section, we will first prepare our dataset for analysis.

  • Imputing missing values

Think About It:

  • Can we extract the age of each customer and create a new feature?
  • Can we find the total kids and teens in the home?
  • Can we find out how many members each family has?
  • Can we find the total amount spent by the customers on various products?
  • Can we find out how long the customer has been with the company?
  • Can we find out how many offers the customers have accepted?
  • Can we find out amount spent per purchase?
  1. Yes, you can create ranges as categorical values
  2. Yes, create a column that sums both
  3. Imidiate family yes, convert marital status to numbers using conditions and add total kids in home
  4. Yes, create a total column that sums amounts
  5. Yes, create a total offers column summing the accepted offers columns.
  6. Not sure.
  7. Yes, but it would be an average.
In [33]:
df["Age"] = 2016 - df.Year_Birth 
In [34]:
df.drop(df[df["Age"] > 115].index, inplace=True)
In [35]:
histogram_boxplot(df.Age, figsize = (15,10), bins = None)
lower_bound:  20
upper_bound 76
In [36]:
df["Kids"] = df.Teenhome + df.Kidhome
In [37]:
df.Marital_Status.unique().tolist()
Out[37]:
['Single', 'Together', 'Married', 'Divorced', 'Widow']
In [38]:
df["Marital_Status"].replace(["Married", "Together"], "Relationship", inplace=True)
df["Marital_Status"].replace(["Divorced", "Widow"], "Single", inplace=True)
In [39]:
df.Marital_Status.value_counts()
Out[39]:
Relationship    1428
Single           784
Name: Marital_Status, dtype: int64
In [40]:
df["Status"] = df["Marital_Status"].replace({"Single": 1, "Relationship": 2})
In [41]:
df["Family_Size"] = df["Status"] + df["Kids"]
In [42]:
df["Expenses"] = df[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']].sum(axis=1)
In [43]:
df["NumTotalPurchases"] = df[['NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']].sum(axis=1)
In [44]:
df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"], format="%d-%m-%Y")
In [45]:
df["Dt_Customer"].min()
Out[45]:
Timestamp('2012-07-30 00:00:00')
In [46]:
df["Dt_Customer"].max()
Out[46]:
Timestamp('2014-06-29 00:00:00')
In [47]:
 # Assigning date to the day variable
df["day"] = "01-01-2015"                         

# Converting the variable day to Python datetime object
df["day"] = pd.to_datetime(df.day)     
In [48]:
df["Engaged_in_days"] = (df["day"] - df["Dt_Customer"]).dt.days     
In [49]:
df["TotalAcceptedCmp"] = df[["AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4", "AcceptedCmp5", "Response"]].sum(axis=1)
In [50]:
df['AmountPerPurchase'] = df['Expenses'] / df['NumTotalPurchases']
In [51]:
df['AmountPerPurchase'].max()
Out[51]:
inf
In [52]:
df[df['NumTotalPurchases'] == 0].shape[0]
Out[52]:
4
In [53]:
df.drop(df[df['NumTotalPurchases'] == 0].index, inplace=True)
In [54]:
histogram_boxplot(df.AmountPerPurchase, figsize = (15,10), bins = None)
lower_bound:  0.5333333333333333
upper_bound 98.91666666666667

Observation¶

  1. There are a few sales above $100

Important Insights from EDA and Data Preprocessing¶

What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?

  1. Outliers had to be removed
  2. Grouped some values that should be grouped but had different names or values.

Data Preparation for Segmentation¶

  • The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.
  • Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.
  • Plot the correlation plot after we've removed the irrelevant variables
  • Scale the Data
In [55]:
# Dropping all the irrelevant columns and storing in data_model
data_model = df.drop(
    columns=[
        "ID",
        "Year_Birth",
        "Dt_Customer",
        "day",
        "Complain",
        "Response",
        "AcceptedCmp1",
        "AcceptedCmp2",
        "AcceptedCmp3",
        "AcceptedCmp4",
        "AcceptedCmp5",
        "Marital_Status",
        "Status",
        "Kids",
        'Education',
        'Kidhome',
        'Teenhome', 'Income','Age', 'Family_Size'
    ],
    axis=1,
)
In [56]:
df.shape
Out[56]:
(2208, 37)
In [57]:
data_model.shape
Out[57]:
(2208, 17)
In [58]:
data_model.head()
Out[58]:
Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
0 58 635 88 546 172 88 88 3 8 10 4 7 1617 25 849 1 64.680000
1 38 11 1 6 2 1 6 2 1 1 2 5 27 6 299 0 4.500000
2 26 426 49 127 111 21 42 1 8 2 10 4 776 21 498 0 36.952381
3 26 11 4 20 10 3 5 2 2 0 4 6 53 8 325 0 6.625000
4 94 173 43 118 46 27 15 5 5 3 6 5 422 19 347 0 22.210526
In [59]:
from sklearn.preprocessing import StandardScaler


# Scaling the data and storing the output as a new DataFrame
scaler = StandardScaler()
numerical_columns = df.select_dtypes(include=[np.number])

data_scaled_all = pd.DataFrame(scaler.fit_transform(numerical_columns), columns = numerical_columns.columns)

data_scaled_all.head()
Out[59]:
ID Year_Birth Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Status Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
0 -0.017526 -1.018743 0.292925 -0.823235 -0.931270 0.311060 0.975863 1.550053 1.688262 2.451041 1.481762 0.850833 0.348987 1.425749 2.501580 -0.560167 0.698394 -0.282324 -0.283258 -0.280449 -0.262168 -0.115364 -0.095607 2.372895 1.018743 -1.266516 -1.351719 -1.761762 1.674434 1.317746 1.526655 0.617845 1.120445
1 -1.049811 -1.275141 -0.261644 1.039166 0.906303 -0.380080 -0.874099 -0.638273 -0.719275 -0.651903 -0.634855 -0.734941 -0.170961 -1.130379 -0.572908 -1.176741 -0.132230 -0.282324 -0.283258 -0.280449 -0.262168 -0.115364 -0.095607 -0.421426 1.275141 1.403143 -1.351719 0.447193 -0.965102 -1.165726 -1.190250 -0.504032 -0.986284
2 -0.443690 -0.335013 0.926537 -0.823235 -0.931270 -0.794765 0.356244 0.569079 -0.179808 1.337632 -0.148276 -0.038747 -0.690910 1.425749 -0.231298 1.289556 -0.547542 -0.282324 -0.283258 -0.280449 -0.262168 -0.115364 -0.095607 -0.421426 0.335013 -1.266516 0.739799 -0.657284 0.278302 0.794910 -0.207224 -0.504032 0.149781
3 0.185233 1.288844 -1.187869 1.039166 -0.931270 -0.794765 -0.874099 -0.562814 -0.656857 -0.505882 -0.586197 -0.754280 -0.170961 -0.765218 -0.914518 -0.560167 0.283082 -0.282324 -0.283258 -0.280449 -0.262168 -0.115364 -0.095607 -0.421426 -1.288844 0.068313 0.739799 0.447193 -0.921940 -0.904308 -1.061814 -0.504032 -0.911894
4 -0.079155 1.032445 0.300213 1.039166 -0.931270 1.555113 -0.393821 0.418160 -0.219934 0.151212 -0.002303 -0.560893 1.388885 0.330266 0.110311 0.056408 -0.132230 -0.282324 -0.283258 -0.280449 -0.262168 -0.115364 -0.095607 -0.421426 -1.032445 0.068313 0.739799 0.447193 -0.309368 0.533492 -0.953138 -0.504032 -0.366289
In [60]:
# Scaling the data and storing the output as a new DataFrame

scaler = StandardScaler()

data_scaled = pd.DataFrame(scaler.fit_transform(data_model), columns = data_model.columns)

data_scaled.head()
Out[60]:
Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
0 0.311060 0.975863 1.550053 1.688262 2.451041 1.481762 0.850833 0.348987 1.425749 2.501580 -0.560167 0.698394 1.674434 1.317746 1.526655 0.617845 1.120445
1 -0.380080 -0.874099 -0.638273 -0.719275 -0.651903 -0.634855 -0.734941 -0.170961 -1.130379 -0.572908 -1.176741 -0.132230 -0.965102 -1.165726 -1.190250 -0.504032 -0.986284
2 -0.794765 0.356244 0.569079 -0.179808 1.337632 -0.148276 -0.038747 -0.690910 1.425749 -0.231298 1.289556 -0.547542 0.278302 0.794910 -0.207224 -0.504032 0.149781
3 -0.794765 -0.874099 -0.562814 -0.656857 -0.505882 -0.586197 -0.754280 -0.170961 -0.765218 -0.914518 -0.560167 0.283082 -0.921940 -0.904308 -1.061814 -0.504032 -0.911894
4 1.555113 -0.393821 0.418160 -0.219934 0.151212 -0.002303 -0.560893 1.388885 0.330266 0.110311 0.056408 -0.132230 -0.309368 0.533492 -0.953138 -0.504032 -0.366289

Applying T-SNE and PCA to the data to visualize the data distributed in 2 dimensions¶

Applying T-SNE¶

In [61]:
from sklearn.manifold import TSNE

#  t-SNE is a non-linear dimensionality reduction technique that focuses on preserving the local structure of the data
# Apply the t-SNE algorithm with random_state = 1
tsne = TSNE(n_components = 2, random_state = 1)
# Fit and transform t-SNE function on the scaled data
data_tsne_all = tsne.fit_transform(data_scaled_all) # Complete the code   

data_tsne_all = pd.DataFrame(data = data_tsne_all, columns = ['Component 1', 'Component 2'])
In [62]:
mkt_tsne_all = tsne.fit_transform(data_scaled_all)

# Converting the embeddings to a dataframe
mkt_tsne_all = pd.DataFrame(mkt_tsne_all, columns = ["Component 1", "Component 2"])

# Scatter plot for two components
plt.figure(figsize = (7, 7))

sns.scatterplot(x = data_tsne_all.iloc[:,0], y = data_tsne_all.iloc[:,1])

plt.show()

Observation: Show some clear clusters when no columns are removed.

In [63]:
from sklearn.manifold import TSNE

#  t-SNE is a non-linear dimensionality reduction technique that focuses on preserving the local structure of the data
# Apply the t-SNE algorithm with random_state = 1
tsne = TSNE(n_components = 2, random_state = 1)
# Fit and transform t-SNE function on the scaled data
data_tsne = tsne.fit_transform(data_scaled) # Complete the code   

data_tsne = pd.DataFrame(data = data_tsne, columns = ['Component 1', 'Component 2'])
In [64]:
mkt_tsne = tsne.fit_transform(data_scaled)

# Converting the embeddings to a dataframe
mkt_tsne = pd.DataFrame(mkt_tsne, columns = ["Component 1", "Component 2"])

# Scatter plot for two components
plt.figure(figsize = (7, 7))

sns.scatterplot(x = data_tsne.iloc[:,0], y = data_tsne.iloc[:,1])

plt.show()

Observation: When demographic columns are reduced, it brings the samples closer together. Looks more like one big cluster vs man

Applying PCA¶

Think about it:

  • Should we apply clustering algorithms on the current data or should we apply PCA on the data before applying clustering algorithms? How would this help?

Answer: PCA can help reduce the collinearity between variables and provide more meaningful clustering insights.

In [65]:
plt.figure(figsize = (16, 16))

sns.heatmap(data_model.corr(), annot = True, fmt = '0.2f') # Shows correlation between variables.

plt.show()

Observation:

  1. Number of web visits per month has a strong negative correlation between expenses and number of total purchases.
  2. Calculated coluns such as number of total puchaes and expenses have a strong correlation with the individual amounts of items purchased.
In [66]:
# Importing PCA and t-SNE
from sklearn.decomposition import PCA

# PCA helps with dimensionality reduction.
# Defining the number of principal components to generate 
n = data_scaled.shape[1]

# Finding principal components for the data
# Apply the PCA algorithm with random_state = 1
pca = PCA(n_components=n, random_state=1) # Complete the code 

# Fit and transform the pca function on scaled data
data_pca1 = pd.DataFrame(pca.fit_transform(data_scaled)) # Complete the code    

# The percentage of variance explained by each principal component
exp_var = pca.explained_variance_ratio_

data_pca_copy = pd.DataFrame(data_pca1, columns = df.columns)
In [67]:
exp_var
Out[67]:
array([4.48668019e-01, 1.16533332e-01, 6.94607714e-02, 6.02872096e-02,
       5.78382351e-02, 4.16408241e-02, 3.79264953e-02, 3.54636976e-02,
       2.76229835e-02, 2.52984813e-02, 2.31137962e-02, 2.30471218e-02,
       1.58105413e-02, 1.44767616e-02, 2.81172999e-03, 6.80243099e-33,
       1.75106629e-33])
In [68]:
# Visualize the explained variance by individual components
plt.figure(figsize = (10, 10))

plt.plot(range(1, 18), exp_var.cumsum(), marker = 'o', linestyle = '--')

plt.title("Explained Variances by Components")

plt.xlabel("Number of Components")

plt.ylabel("Cumulative Explained Variance")

plt.show()
In [69]:
np.cumsum(exp_var)
Out[69]:
array([0.44866802, 0.56520135, 0.63466212, 0.69494933, 0.75278757,
       0.79442839, 0.83235489, 0.86781858, 0.89544157, 0.92074005,
       0.94385385, 0.96690097, 0.98271151, 0.99718827, 1.        ,
       1.        , 1.        ])

Observation¶

  1. Nearly 80% of the variance can be explained by 6 components.
In [70]:
pc_comps = ['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6']

data_pca = pd.DataFrame(np.round(pca.components_[:6,:], 2), index = pc_comps, columns = data_scaled.columns) # Visialize PC coefficients.
In [71]:
def color_high(val):
    if val <= -0.40:
        return 'background: pink'
    
    elif val >= 0.40:
        return 'background: skyblue'   
    
data_pca.T.style.applymap(color_high)
Out[71]:
  PC1 PC2 PC3 PC4 PC5 PC6
Recency 0.010000 0.000000 -0.320000 0.540000 -0.750000 -0.120000
MntWines 0.300000 0.130000 0.290000 -0.050000 -0.220000 0.030000
MntFruits 0.250000 -0.150000 -0.230000 0.120000 0.280000 -0.060000
MntMeatProducts 0.300000 -0.150000 0.060000 0.080000 -0.040000 0.330000
MntFishProducts 0.260000 -0.160000 -0.220000 0.140000 0.260000 -0.050000
MntSweetProducts 0.250000 -0.130000 -0.220000 0.130000 0.220000 -0.020000
MntGoldProds 0.210000 0.100000 -0.110000 0.120000 0.160000 -0.710000
NumDealsPurchases -0.020000 0.550000 -0.170000 -0.140000 -0.020000 0.200000
NumWebPurchases 0.210000 0.380000 -0.060000 -0.170000 -0.030000 -0.300000
NumCatalogPurchases 0.300000 -0.020000 0.020000 -0.050000 -0.110000 0.210000
NumStorePurchases 0.270000 0.130000 -0.190000 -0.260000 -0.080000 0.100000
NumWebVisitsMonth -0.200000 0.410000 0.170000 0.250000 0.140000 -0.120000
Expenses 0.350000 -0.010000 0.130000 0.040000 -0.070000 0.070000
NumTotalPurchases 0.300000 0.320000 -0.140000 -0.220000 -0.090000 0.070000
Engaged_in_days 0.060000 0.370000 0.100000 0.610000 0.330000 0.330000
TotalAcceptedCmp 0.150000 -0.000000 0.680000 0.050000 -0.090000 -0.230000
AmountPerPurchase 0.320000 -0.120000 0.200000 0.140000 -0.030000 0.010000

Observation and Insights:

  1. PC1: No strong influential variables.
  2. PC2: NumDealsPurchases, and NumWebVisitsMonth
  3. PC3: TotalAcceptedCmp
  4. PC4: Recency and engaged days
In [72]:
# df_model.head()

K-Means¶

Think About It:

  • How do we determine the optimal K value from the elbow curve?
  • Which metric can be used to determine the final K value?
In [73]:
# Importing clustering algorithms
from sklearn.cluster import KMeans

#  K-Means is a distance-based clustering algorithm that aims to minimize the within-cluster sum of squares
# step 1
sse = {} 

# step 2 - iterate for a range of Ks and fit the pca components to the algorithm. 
for k in range(1, 10):
    kmeans = KMeans(n_clusters = k, max_iter = 1000, random_state = 1).fit(data_pca1)
    sse[k] = kmeans.inertia_     # Use inertia attribute from the clustering object and store the inertia value for that K

# step 3
plt.figure()

plt.plot(list(sse.keys()), list(sse.values()), 'bx-')

plt.xlabel("Number of cluster")

plt.ylabel("SSE")

plt.show()

Observation:¶

  1. Based on this elbow plot, the k should be 4. It suggests that dividing the data into four clusters would provide a reasonable trade-off between minimizing SSE and ensuring meaningful cluster assignments.
In [74]:
sil_score = []                                                             # Creating empty list
cluster_list = range(3, 7)                                                 # Creating a range from 3 to 7
for n_clusters in cluster_list:
    
    # Initialize K-Means with number of clusters equal to n_clusters and random_state=1
    clusterer = KMeans(n_clusters=n_clusters, random_state=1)
    
    # Fit and predict on the pca data
    preds = clusterer.fit_predict(data_pca1)
    
    # Calculate silhouette score - Hint: Use silhouette_score() function
    score = silhouette_score(data_pca1, preds)
    
    # Append silhouette score to empty list created above
    sil_score.append(score)
    
    # Print the silhouette score
    print("For n_clusters = {}, the silhouette score is {}".format(n_clusters, score))
For n_clusters = 3, the silhouette score is 0.27201384838171144
For n_clusters = 4, the silhouette score is 0.25661259468245723
For n_clusters = 5, the silhouette score is 0.23583061465612465
For n_clusters = 6, the silhouette score is 0.13010333058878457

Applying KMeans on the PCA data and visualize the clusters¶

In [75]:
kmeans = KMeans(n_clusters = 3, random_state = 1)

kmeans.fit(data_pca1)
Out[75]:
KMeans(n_clusters=3, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=3, random_state=1)
In [76]:
data_pca1["K_means_segments_3"] = kmeans.labels_                    # Adding K-Means cluster labels to the data_pca data

df["K_means_segments_3"] = kmeans.labels_                        # Adding K-Means cluster labels to the whole data

data_scaled["K_means_segments_3"] = kmeans.labels_                  # Adding K-Means cluster labels to data_scaled
In [77]:
silhouette_avg = silhouette_score(data_pca1, kmeans.labels_)
print("The silhouette score for data_pca1 is:", silhouette_avg)
The silhouette score for data_pca1 is: 0.29586573188903226
In [78]:
# Df stores methods and silhouettte score
results = []

# Append method and score to the results list
results.append({"Method": 'Kmeans_3', "Score": silhouette_avg})
In [79]:
print(results)
[{'Method': 'Kmeans_3', 'Score': 0.29586573188903226}]
In [80]:
# Function to visualize PCA data with clusters formed
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
In [81]:
PCA_PLOT(0, 1, data_pca1, "K_means_segments_3")

Observation:¶

  1. Three distinct clusters.

Cluster Profiling¶

In [82]:
cluster_profile_KMeans_3 = data_pca1.groupby('K_means_segments_3').mean()
cluster_profile_KMeans_3.style.highlight_max(color="lightgreen", axis=0)
Out[82]:
  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
K_means_segments_3                                  
0 0.768588 1.374578 -0.211799 -0.431943 -0.153868 -0.107772 -0.153745 -0.156302 0.040737 -0.044713 -0.005825 -0.024341 0.061141 -0.020934 0.071660 0.000000 0.000000
1 -2.484489 -0.360640 0.076136 0.134909 0.050476 0.014695 0.036093 0.046186 0.005626 -0.002559 -0.000593 -0.000505 -0.026724 0.014590 -0.035199 0.000000 0.000000
2 3.772440 -0.846935 0.092243 0.225994 0.076007 0.091663 0.102623 0.086674 -0.055443 0.054129 0.007534 0.027816 -0.017803 -0.004023 -0.013654 0.000000 0.000000

Observation¶

  1. Cluster have good variance of mean.
In [83]:
# Columns to use in boxplot
# col_for_box = ['Income','Kidhome','Teenhome','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age','Family_Size','Expenses','NumTotalPurchases','Engaged_in_days','TotalAcceptedCmp','AmountPerPurchase']
In [84]:
# Creating boxplot for each of the variables
all_col = ['Income','NumDealsPurchases','NumWebVisitsMonth', 'NumWebPurchases','TotalAcceptedCmp', 'Expenses','NumTotalPurchases']

plt.figure(figsize = (20, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=df[variable], x=df['K_means_segments_3'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Describe the characteristics of each cluster¶

Think About It:

  • Are the K-Means profiles providing any deep insights into customer purchasing behavior or which channels they are using? There are many outlies which increases the chances of misclassifying customers. While there is unclear cluster distinction for some columns, for the most part there is good insights into customer behavior and the channels they are using.
  • What is the next step to get more meaningful insights? Try K-means with 5 clusters.

Summary of each cluster:

Cluster 0:

  • Moderate income, moderate-high NumDealPurchases
  • Highest NumWeVisitsMonth, highest NumWebPurchases
  • Low number of accepted campaigns
  • Moderate expenses, moderate-high NumTotalPurchases

Cluster 1:

  • Lowest income, low NumDealsPurchases
  • Highest NumWebVisitsMonth, lowest NumwebPurchases
  • Lowest Total accepted campaigns
  • Lowest expenses, lowest NumTotalPurchases

Cluster 2:

  • Highest income, lowest NumDealsPurchases
  • Lowest NumWebVisitsMonth, moderate NumwebPurchases
  • Highest Total accepted campaigns
  • High expenses, moderate-high NumTotalPurchases
In [85]:
# Drop columns
data_pca1.drop("K_means_segments_3", axis=1, inplace=True)
df.drop("K_means_segments_3", axis=1, inplace=True)
In [86]:
kmeans = KMeans(n_clusters = 5, random_state = 1)

kmeans.fit(data_pca1)
Out[86]:
KMeans(n_clusters=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5, random_state=1)
In [87]:
data_pca1["K_means_segments_5"] = kmeans.labels_  # Adding K-Means cluster labels to the data_pca data

df["K_means_segments_5"] = kmeans.labels_     # Adding K-Means cluster labels to the whole data

data_scaled["K_means_segments_5"] = kmeans.labels_  # Adding K-Means cluster labels to data_scaled
In [88]:
silhouette_avg = silhouette_score(data_pca1, kmeans.labels_)
print("The silhouette score for data_pca1 is:", silhouette_avg)
The silhouette score for data_pca1 is: 0.2894335595121166
In [89]:
# Append method and score to the results list
results.append({"Method": 'Kmeans_5', "Score": silhouette_avg})
print(results)
[{'Method': 'Kmeans_3', 'Score': 0.29586573188903226}, {'Method': 'Kmeans_5', 'Score': 0.2894335595121166}]
In [90]:
PCA_PLOT(0, 1, data_pca1, "K_means_segments_5")
In [91]:
cluster_profile_KMeans_5 = data_pca1.groupby('K_means_segments_5').mean()
cluster_profile_KMeans_5.style.highlight_max(color="lightgreen", axis=0)
Out[91]:
  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
K_means_segments_5                                  
0 -2.552404 -0.478570 0.081441 0.135359 0.045405 0.002666 0.011847 0.021778 0.010197 0.000036 0.002878 -0.007709 -0.036259 0.013451 -0.046011 0.000000 0.000000
1 1.578934 0.442909 -0.504456 -0.600998 -0.192214 -0.053760 -0.457999 -0.406103 0.090076 -0.030772 0.089143 0.036990 -0.091341 -0.018903 0.081260 0.000000 0.000000
2 -0.285200 2.021514 -0.019120 -0.139724 -0.035601 -0.052136 0.290267 0.183086 -0.051196 -0.013738 -0.073046 -0.015239 0.215441 -0.020764 0.078775 0.000000 0.000000
3 4.155756 -0.297457 2.260443 0.280019 -0.626976 -0.155703 -0.144649 0.378139 0.248659 -0.151026 -0.074191 0.021742 0.078000 -0.032695 -0.060042 0.000000 0.000000
4 3.850381 -1.108559 -0.690979 0.342767 0.437119 0.186212 0.284502 0.054486 -0.204137 0.123750 -0.004193 -0.016664 -0.046377 0.020934 -0.017435 0.000000 0.000000
In [92]:
# Creating boxplot for each of the variables

plt.figure(figsize = (20, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=df[variable], x=df['K_means_segments_5'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Observation¶

  1. K=5 provides better insights for some columns like family size and others but adds a level of complexity to interpretation.

Cluster 0: Customers in this group have a moderate income and don't frequently take advantage of deals. They visit websites a fair amount and make a moderate to high number of online purchases. They also accept a low number of marketing campaigns. Their total purchases are moderate, and each purchase tends to be for a relatively low to moderate amount.

  • Lowest income, low number of deal purchases
  • Moderate-high number of web visits, lowest number of web purchases
  • Lowest number of total accepted campaigns
  • Lowest expenses, lowest total purchases

Cluster 1: Customers in this group have a moderate income and don't frequently take advantage of deals. They visit websites a fair amount and make a moderate to high number of online purchases. They also accept a low number of marketing campaigns. Their total purchases are moderate, and each purchase tends to be for a relatively low to moderate amount.

  • Moderate income, low number of deal purchases
  • Moderate number of web visits, moderate-high number of web purchases
  • Lowest number of total accepted campaigns
  • Moderate expenses, moderate total purchasese

Cluster 2: With a moderate income, these customers make the most deal purchases. They visit websites frequently and make a moderate number of online purchases. They don't accept many marketing campaigns. Their total number of purchases is moderate, and each purchase tends to be for a relatively low amount.

  • Moderate income, highest number of deal purchases
  • High number of web visits, moderate web purchases
  • Low total accepted campaigns
  • Lowest expenses, moderate total purchases

Cluster 3: These customers have a high income but make the lowest number of purchases when there are deals. They have a low to moderate number of web visits but make a moderate number of web purchases. They accept the highest number of marketing campaigns. Their total number of purchases is moderate, and each purchase tends to be for a moderate to high amount.

  • High income, lowest deal purchases
  • Low-moderate number of web visits, moderate number of web purchases
  • Highest total accepted campaigns
  • Highest expenses, moderate total purchases

Cluster 4: Customers in this group have a high income but make the lowest number of deal purchases. They have the lowest web visitation and make a low to moderate number of web purchases. They don't accept many marketing campaigns. Their total number of purchases is moderate, and each purchase tends to be for a moderate amount.

  • High income, lowest number of deal purchases
  • Lowest web visits, low-moderate web purchases
  • Low accepted campaigns
  • High expenses, moderate total purchases

K-Medoids¶

In [93]:
from sklearn_extra.cluster import KMedoids

# Creating copy of the data to store labels from each algorithm
# data_scaled_copy = data_k_pca.copy(deep = True)

for n_clusters in range(2, 5):
    # Initialize K-Medoids with the specified number of clusters
    kmedo = KMedoids(n_clusters=n_clusters, random_state=1)
    kmedo.fit(data_pca1)

    # Predict cluster labels for the data points
    df[f'kmedoLabels_{n_clusters}'] = kmedo.predict(data_pca1)

    # Calculate silhouette score
    silhouette_avg = silhouette_score(data_pca1, kmedo.labels_)
    results.append({"Method": f'Medioids_{n_clusters}', "Score": silhouette_avg})
    print(f"The silhouette score for data_pca1 with {n_clusters} clusters is: {silhouette_avg}")
The silhouette score for data_pca1 with 2 clusters is: 0.3127386542698528
The silhouette score for data_pca1 with 3 clusters is: 0.12521679142543043
The silhouette score for data_pca1 with 4 clusters is: 0.11421820199844311
In [94]:
# Print silhouette scores
print(results)
[{'Method': 'Kmeans_3', 'Score': 0.29586573188903226}, {'Method': 'Kmeans_5', 'Score': 0.2894335595121166}, {'Method': 'Medioids_2', 'Score': 0.3127386542698528}, {'Method': 'Medioids_3', 'Score': 0.12521679142543043}, {'Method': 'Medioids_4', 'Score': 0.11421820199844311}]

Observation:

  1. The highest silhouette score is achieved when using two clusters for K-medoids. However, it is important to note that these clusters may be too general and may not capture all the significant insights.

Visualize the clusters using PCA¶

In [95]:
plt.figure(figsize=(20, 20))


for i, col in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    sns.boxplot(x='kmedoLabels_2', y=col, data=df)
    plt.tight_layout()
    
    plt.title(col)

plt.show()

Observation:

  1. Probably too genral for any meaninful insights.
In [96]:
plt.figure(figsize=(20, 20))


for i, col in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    sns.boxplot(x='kmedoLabels_3', y=col, data=df)
    plt.tight_layout()
    
    plt.title(col)

plt.show()

Observation:

  1. Clusters are not too clear and distinct. Will skip kmedoids_5 given it has a lower silhouette score.

Cluster Profiling¶

Observations and Insights: Overall, the clusters don't provide many insights as there is not enough variance between clusters.

Characteristics of each cluster¶

Cluster 0: Highest income, highest number of web visits

Cluster 1: Similar to cluster 2

Cluster 2: Too similar to cluster 1

Observations and Insights: Not very insightful from K-mediod clusters.

Hierarchical Clustering¶

  • Find the Cophenetic correlation for different distances with different linkage methods.
  • Create the dendrograms for different linkages
  • Explore different linkages with each distance metric
In [97]:
# Drop columns
data_pca1.drop("K_means_segments_5", axis=1, inplace=True)
df.drop("K_means_segments_5", axis=1, inplace=True)
In [98]:
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# list of linkage methods
linkage_methods = ["single", "complete", "average"]

high_cophenet_corr = 0                                                 # Creating a variable by assigning 0 to it
high_dm_lm = [0, 0]                                                    # Creating a list by assigning 0's to it

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(data_pca1, metric=dm, method=lm)                    # Applying different linkages with different distance on data_pca
        c, coph_dists = cophenet(Z, pdist(data_pca1))                   # Calculating cophenetic correlation
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:                                     # Checking if cophenetic correlation is higher than previous score
            high_cophenet_corr = c                                     # Appending to high_cophenet_corr list if it is higher
            high_dm_lm[0] = dm                                         # Appending its corresponding distance
            high_dm_lm[1] = lm                                         # Appending its corresponding method or linkage
Cophenetic correlation for Euclidean distance and single linkage is 0.8129807227587553.
Cophenetic correlation for Euclidean distance and complete linkage is 0.7537524321636122.
Cophenetic correlation for Euclidean distance and average linkage is 0.8151761018960105.
Cophenetic correlation for Chebyshev distance and single linkage is 0.8051749754976828.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.6813736478845351.
Cophenetic correlation for Chebyshev distance and average linkage is 0.774784181804776.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.5376203582123958.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.3693232898614882.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.5881693054344688.
Cophenetic correlation for Cityblock distance and single linkage is 0.8149845035005294.
Cophenetic correlation for Cityblock distance and complete linkage is 0.7627696622102742.
Cophenetic correlation for Cityblock distance and average linkage is 0.8411010987060117.
In [99]:
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Highest cophenetic correlation is 0.8411010987060117, which is obtained with Cityblock distance and average linkage.
In [100]:
# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca1, metric="Cityblock", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca1))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Think about it:

  • Can we clearly decide the number of clusters based on where to cut the dendrogram horizontally?
  • What is the next step in obtaining number of clusters based on the dendrogram?
  • Are there any distinct clusters in any of the dendrograms?

Observation¶

  1. Complete linkage is easiest to interpret.
In [101]:
# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))  # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca1, metric="chebyshev", method=method)  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")  # Title of dendrogram0
    coph_corr, coph_dist = cophenet(Z, pdist(data_pca1))  # Finding cophenetic correlation for different linkages with Chebyshev distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

# Show the plot
plt.tight_layout()
plt.show()

Observation¶

Complete linkage generates better separated clusters.

In [102]:
plt.figure(figsize = (20, 7))  

plt.title("Dendrograms")  

dend = dendrogram(linkage(data_scaled, method = 'complete'))

plt.axhline(y = 15, color = 'r', linestyle = '--')
Out[102]:
<matplotlib.lines.Line2D at 0x15a50e09970>
In [103]:
# data_pca1 = data_pca1.drop('dendrogram', axis=1)

# data_pca1.head()

Visualize the clusters using PCA¶

In [104]:
HCmodel = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
HCmodel.fit(data_pca1)
Out[104]:
AgglomerativeClustering(affinity='euclidean', n_clusters=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(affinity='euclidean', n_clusters=4)
In [105]:
data_pca1["dendrogram"] = HCmodel.labels_  # Adding K-Means cluster labels to the data_pca data

df["dendrogram"] = HCmodel.labels_     # Adding K-Means cluster labels to the whole data

data_scaled["dendrogram"] = HCmodel.labels_  # Adding K-Means cluster labels to data_scaled
In [106]:
silhouette_avg = silhouette_score(data_pca1, HCmodel.labels_)
print("The silhouette score for data_pca1 is:", silhouette_avg)
The silhouette score for data_pca1 is: 0.22837691444799138
In [107]:
# Append method and score to the results list
results.append({"Method": 'Dendrogram', "Score": silhouette_avg})
print(results)
[{'Method': 'Kmeans_3', 'Score': 0.29586573188903226}, {'Method': 'Kmeans_5', 'Score': 0.2894335595121166}, {'Method': 'Medioids_2', 'Score': 0.3127386542698528}, {'Method': 'Medioids_3', 'Score': 0.12521679142543043}, {'Method': 'Medioids_4', 'Score': 0.11421820199844311}, {'Method': 'Dendrogram', 'Score': 0.22837691444799138}]
In [108]:
PCA_PLOT(0, 1, data_pca1, "dendrogram")

Cluster Profiling¶

In [109]:
# Creating boxplot for each of the variables

plt.figure(figsize = (20, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=df[variable], x=df['dendrogram'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
In [110]:
df.dendrogram.value_counts()
Out[110]:
1    928
0    591
2    450
3    239
Name: dendrogram, dtype: int64

Observations and Insights:

  1. Some columns have good diffrentiation while others overlap.
  2. They also have notable outliers that significantly deviate from the central distribution.

Characteristics of each cluster¶

Summary of each cluster:

Cluster 0:

  • Highest income, lowest NumDealPurchases
  • Lowest NumWeVisitsMonth, moderate NumWebPurchases
  • Low-moderate number of accepted campaigns
  • High expenses, moderate NumTotalPurchases

Cluster 1:

  • Lowest income, low NumDealPurchases
  • Moderate NumWeVisitsMonth, low NumWebPurchases
  • Lowest number of accepted campaigns
  • Lowest expenses, lowest NumTotalPurchases

Cluster 2:

  • Moderate income, moderate-high NumDealPurchases
  • Moderate NumWeVisitsMonth, moderate NumWebPurchases
  • Low number of accepted campaigns
  • Low expenses, moderate NumTotalPurchases

Cluster 3:

  • Moderate-high income, moderate NumDealPurchases
  • Moderate NumWeVisitsMonth, moderate-high NumWebPurchases
  • Low number of accepted campaigns
  • Moderate-high expenses, moderate-high NumTotalPurchases

DBSCAN¶

DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.

Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.

In [111]:
data_pca1.drop("dendrogram", axis=1, inplace=True)
In [112]:
# Initializing lists
eps_value = [2,3]                       # Taking random eps value
min_sample_values = [6,20]              # Taking random min_sample value

# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}  
In [113]:
from sklearn.cluster import DBSCAN


# Finding the silhouette_score for each of the combinations
high_silhouette_avg = 0                                               # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0]                                                     # Assigning 0's to the high_i_j list
key = res.keys()                                                      # Assigning dictionary keys to a variable called key
for i in key:
    z = res[i]                                                        # Assigning dictionary values of each i to z
    for j in z:
        db = DBSCAN(eps=i, min_samples=j).fit(data_pca1)               # Applying DBSCAN to each of the combination in dictionary
        core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
        core_samples_mask[db.core_sample_indices_] = True
        labels = db.labels_
        silhouette_avg = silhouette_score(data_pca1, labels)           # Finding silhouette score 
        print( 
            "For eps value =" + str(i),
            "For min sample =" + str(j),
            "The average silhoutte_score is :",
            silhouette_avg,                                          # Printing the silhouette score for each of the combinations
        )
        if high_silhouette_avg < silhouette_avg:                     # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j              
            high_i_j[0] = i
            high_i_j[1] = j
            high_silhouette_avg = silhouette_avg
For eps value =2 For min sample =6 The average silhoutte_score is : 0.07631293391537533
For eps value =2 For min sample =20 The average silhoutte_score is : 0.3445089447463355
For eps value =3 For min sample =6 The average silhoutte_score is : 0.3307327349278307
For eps value =3 For min sample =20 The average silhoutte_score is : 0.34151752330706703
In [114]:
# Printing the highest silhouette score
print("Highest silhouette average is {:.2f} for eps = {} and min_samples = {}".format(high_silhouette_avg, high_i_j[0], high_i_j[1]))
Highest silhouette average is 0.34 for eps = 2 and min_samples = 20
In [115]:
# Append method and score to the results list
results.append({"Method": 'DBSCAN', "Score": high_silhouette_avg})
print(results)
[{'Method': 'Kmeans_3', 'Score': 0.29586573188903226}, {'Method': 'Kmeans_5', 'Score': 0.2894335595121166}, {'Method': 'Medioids_2', 'Score': 0.3127386542698528}, {'Method': 'Medioids_3', 'Score': 0.12521679142543043}, {'Method': 'Medioids_4', 'Score': 0.11421820199844311}, {'Method': 'Dendrogram', 'Score': 0.22837691444799138}, {'Method': 'DBSCAN', 'Score': 0.3445089447463355}]

Apply DBSCAN for the best hyperparameter and visualize the clusters from PCA¶

In [116]:
dbs = DBSCAN(eps=3, min_samples=20)
In [117]:
data_pca1["DBSCAN_cluster"] = kmeans.labels_                    # Adding DBSCAN cluster labels to the data_pca data

df["DBSCAN_cluster"] = kmeans.labels_                        # Adding DBSCAN cluster labels to the whole data

data_scaled["DBSCAN_cluster"] = kmeans.labels_                  # Adding DBSCAN cluster labels to data_scaled
In [118]:
PCA_PLOT(0, 1, data_pca1, "DBSCAN_cluster")

Observation:¶

Acceptable distribution with some overlap near the right side.

In [119]:
# Creating boxplot for each of the variables
# all_col = ['Income','Teenhome', 'Kidhome', 'Age', 'Family_Size', 'Expenses', 'NumTotalPurchases', 'Engaged_in_days', 'TotalAcceptedCmp', 'AmountPerPurchase']

plt.figure(figsize = (20, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=df[variable], x=df['DBSCAN_cluster'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Observations and Insights:

  1. Certain columns, such as the number of store purchases, expenses, total purchases, and amount per purchase, exhibit a distribution pattern similar to income.
  2. Conversely, there are columns like the number of web visits and family size that show an inverse distribution compared to income.
  3. The variation depicted by the boxplots is often unclear, as there is significant overlap between them.

Think about it:

  • Changing the eps and min sample values will result in different DBSCAN results? Can we try more value for eps and min_sample?

Characteristics of each cluster¶

Summary of each cluster:

Cluster 0:

  • Lowest income, low NumDealPurchases
  • High NumWeVisitsMonth, low NumWebPurchases
  • Lowest number of accepted campaigns
  • Lowest expenses, lowest NumTotalPurchases

Cluster 1:

  • Moderate income, low-moderate NumDealPurchases
  • Moderate NumWeVisitsMonth, moderate-high NumWebPurchases
  • Lowest number of accepted campaigns
  • moderate expenses, modereate NumTotalPurchases

Cluster 2:

  • Low-moderate income, moderate-high NumDealPurchases
  • High NumWeVisitsMonth, moderate NumWebPurchases
  • Low number of accepted campaigns
  • Low expenses, moderate NumTotalPurchases

Cluster 3:

  • Highest income, lowest NumDealPurchases
  • Low-moderate NumWeVisitsMonth, moderate NumWebPurchases
  • High number of accepted campaigns
  • Highest expenses, moderate NumTotalPurchases

Cluster 4:

  • High income, lowest NumDealPurchases
  • Low NumWeVisitsMonth, low-moderate NumWebPurchases
  • Low number of accepted campaigns
  • High expenses, moderate NumTotalPurchases

Gaussian Mixture Model¶

In [120]:
data_pca1 = data_pca1.drop('DBSCAN_cluster', axis=1)
In [121]:
# data_pca1 = data_pca1.drop('GMM_cluster', axis=1)
In [122]:
from sklearn.mixture import GaussianMixture

for n_components in range(3, 7):
    gmm = GaussianMixture(n_components=n_components, random_state=1)
    preds = gmm.fit_predict(data_pca1)
    score = silhouette_score(data_pca1, preds)
    
    print(f"Number of Components: {n_components}")
    print(f"Silhouette Score: {score}")
    
    results.append({"Method": f"GMM_{n_components}", "Score": score})
Number of Components: 3
Silhouette Score: 0.16256857703931887
Number of Components: 4
Silhouette Score: 0.15384146936782728
Number of Components: 5
Silhouette Score: 0.12086634023827815
Number of Components: 6
Silhouette Score: 0.06504147539555859
In [123]:
# Append method and score to the results list
results.append({"Method": 'GMM', "Score": score})
print(results)
[{'Method': 'Kmeans_3', 'Score': 0.29586573188903226}, {'Method': 'Kmeans_5', 'Score': 0.2894335595121166}, {'Method': 'Medioids_2', 'Score': 0.3127386542698528}, {'Method': 'Medioids_3', 'Score': 0.12521679142543043}, {'Method': 'Medioids_4', 'Score': 0.11421820199844311}, {'Method': 'Dendrogram', 'Score': 0.22837691444799138}, {'Method': 'DBSCAN', 'Score': 0.3445089447463355}, {'Method': 'GMM_3', 'Score': 0.16256857703931887}, {'Method': 'GMM_4', 'Score': 0.15384146936782728}, {'Method': 'GMM_5', 'Score': 0.12086634023827815}, {'Method': 'GMM_6', 'Score': 0.06504147539555859}, {'Method': 'GMM', 'Score': 0.06504147539555859}]

Observations and Insights:

  1. Cluster 4 has a relatively small decrease in the silhouette score compared to Cluster 3 and potentially improve the clustering results.
In [124]:
gmm = GaussianMixture(n_components=4, random_state=1)
preds = gmm.fit_predict(data_pca1)
In [125]:
data_pca1["GMM_cluster"] = preds                    # Adding GMM cluster labels to the data_pca data

df["GMM_cluster"] = preds                        # Adding GMM cluster labels to the whole data
# 
data_scaled["GMM_cluster"] = preds                  # Adding GMM cluster labels to data_scaled
In [126]:
df["GMM_cluster"].value_counts()
Out[126]:
0    750
2    601
1    443
3    414
Name: GMM_cluster, dtype: int64

Visualize the clusters using PCA¶

In [127]:
PCA_PLOT(0, 1, data_pca1, "GMM_cluster")
In [128]:
# Creating boxplot for each of the variables

plt.figure(figsize = (20, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=df[variable], x=df['GMM_cluster'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Observation: Clusters overlap within the box plots.

Cluster Profiling¶

Cluster 0:

  • Lowest income, low NumDealPurchases
  • Moderate NumWeVisitsMonth, low NumWebPurchases
  • Lowest number of accepted campaigns
  • Lowest expenses, lowest NumTotalPurchases

Cluster 1:

  • High income, lowest NumDealPurchases
  • Low NumWeVisitsMonth, moderate NumWebPurchases
  • High number of accepted campaigns
  • High expenses, moderate AmountPerPurchase

Cluster 2:

  • Moderate income, moderate NumDealPurchases
  • Moderate NumWeVisitsMonth, moderate NumWebPurchases
  • Lowest number of accepted campaigns
  • Low expenses, moderate NumTotalPurchases

Cluster 3:

  • High income, moderate NumDealPurchases
  • Moderate NumWeVisitsMonth, moderate NumWebPurchases
  • Low number of accepted campaigns
  • Moderate expenses, high NumTotalPurchases

Observations and Insights:

  1. Many clusters overlap for columns that measure customer behavior.

Characteristics of each cluster¶

Summary of each cluster:

Conclusion and Recommendations¶

Key Measures of success¶

  1. Are the segments generated distinct and meaningful, with clear differentiation from one another? A greater number of distinct groups in the boxplots indicates better differentiation.
    • Although the clusters may not be perfect in terms of being distinct and meaningful, they are still an improvement compared to relying solely on intuition for marketing decisions. It would be beneficial to evaluate the clusters with subject matter experts to gain further insights.
  2. Do the segments align with our marketing goals and objectives, and can we take specific marketing actions to target and engage each segment effectively?
    • Moving forward, marketing efforts would greatly benefit from considering this analysis. Conducting in-depth explorations of specific columns can help us develop stronger and more relatable marketing strategies.
  3. What is the size and reach of each segment, and do they have enough potential customers to justify targeted marketing efforts?
    • It is recommended to reallocate marketing efforts from Profile 0 to Profile 1 and 4. This shift is aimed at focusing on growing our revenue-driving customer base. More detailed information can be explored below.
  4. How homogeneous are the customers within each segment, and are there significant differences between segments?
    • The silhouette score provides a measure of how distinct the clusters are. However, analyzing box plots of these distinct methods can help verify if the clusters are meaningful and provide further insights into their homogeneity.
  5. At the end of the day, did it increase revenue, reduce cost, or a combination of both?
    • Tracking the impact on revenue and costs will be essential going forward. This analysis serves as a promising starting point, but it should be treated as an iterative process, continuously evaluated and adjusted as needed.

Evaluate Clustering Techniques¶

In [129]:
# Convert the results list to a DataFrame
results_df = pd.DataFrame(results)
styled_results = results_df.style.background_gradient(subset=["Score"], cmap="Greens")

# Print the results DataFrame
display(styled_results)
  Method Score
0 Kmeans_3 0.295866
1 Kmeans_5 0.289434
2 Medioids_2 0.312739
3 Medioids_3 0.125217
4 Medioids_4 0.114218
5 Dendrogram 0.228377
6 DBSCAN 0.344509
7 GMM_3 0.162569
8 GMM_4 0.153841
9 GMM_5 0.120866
10 GMM_6 0.065041
11 GMM 0.065041

Technique Evaluation:

  1. Based on the scores provided, the DBSCAN technique is performing the best with a score of 0.344509.
    • DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. It groups together data points that are close to each other and separates regions of lower density. It is particularly effective in identifying clusters of arbitrary shape and handling noise or outliers in the data.
      • The higher score indicates that the clusters formed by DBSCAN are more compact and well-separated compared to the other techniques. It suggests that DBSCAN was able to capture the underlying structure of the data accurately and generate meaningful clusters.
  2. Medoids with 2 clusters also has a relatively close silhouette score, but its analysis is limited due to having only 2 clusters, which reduces the quality of insights it can provide.
  3. Similarly, K-means with 3 clusters performed slightly better than K-means with 5 clusters. However, K-means with 5 clusters offers more comprehensive insights compared to K-means with 3 clusters, which provides more generalized insights.

Room to improve:

  1. Improving feature engineering techniques to enhance the clustering process.
  2. Dealing with outliers in the dataset more effectively.
  3. Consulting industry and company experts to gain valuable insights that can enhance the clustering process.
    1. Leveraging their expertise to identify alternative metrics that can improve the effectiveness of clustering.

Proposal for the Final Solution Design:¶

1.The DBSCAN model is recommended for adoption due to its highest silhouette score and ability to provide meaningful insights. The clusters it generates exhibit the best separation.

  • The adoption of the DBSCAN model solves the problem by accurately capturing the underlying structure of the marketing campaign data, identifying clusters of arbitrary shape, and effectively handling noise or outliers.
  • The DBSCAN model's performance is good enough for deployment in production. Its ability to generate meaningful clusters makes it a reliable choice for extracting insights from the data. Additionally, further evaluation and adjustments should be conducted in an iterative manner to enhance the model's performance and refine the analysis, considering the impact on revenue and costs.

Original questions:

  • What are the diffrent customer segments and how much do they spend on average?
  • What are the key features that describe each segment?
  • Which segments should we focus on? Are there any we should avoid?

What are the diffrent customer segments and how much do they spend on average?¶

In [130]:
prefered_model_stats = df.groupby('DBSCAN_cluster')['NumTotalPurchases'].count()

total_counts = prefered_model_stats.sum()
prefered_model_stats = prefered_model_stats.to_frame(name='CustomerCount')
prefered_model_stats['Percentage of Total Customers'] = prefered_model_stats['CustomerCount'] / total_counts * 100

formatted_prefered_model_stats = prefered_model_stats.round().astype(int)
formatted_prefered_model_stats["CustomerCount"] = formatted_prefered_model_stats["CustomerCount"]
formatted_prefered_model_stats["Percentage of Total Customers"] = formatted_prefered_model_stats["Percentage of Total Customers"]
styled_prefered_model_stats = formatted_prefered_model_stats.style.background_gradient(cmap="RdYlGn")
styled_prefered_model_stats.background_color = 'lightblue'

styled_prefered_model_stats.set_caption("Customer Count")
Out[130]:
Customer Count
  CustomerCount Percentage of Total Customers
DBSCAN_cluster    
0 969 44
1 401 18
2 347 16
3 159 7
4 332 15
In [131]:
df.groupby('DBSCAN_cluster')['Expenses', 'NumTotalPurchases'].mean()
Out[131]:
Expenses NumTotalPurchases
DBSCAN_cluster
0 87.796698 7.507740
1 867.887781 21.683292
2 518.135447 19.285303
3 1659.962264 21.301887
4 1404.900602 20.756024
In [132]:
prefered_model_stats = df.groupby('DBSCAN_cluster')['Expenses', 'NumTotalPurchases'].sum()
prefered_model_stats['AmountPerPurchase'] = prefered_model_stats['Expenses'] / prefered_model_stats['NumTotalPurchases']

formatted_prefered_model_stats = prefered_model_stats.round().astype(int)
formatted_prefered_model_stats = formatted_prefered_model_stats.rename(columns={"Expenses": "Revenue", "NumTotalPurchases": "Total Purchases"})
formatted_prefered_model_stats["AmountPerPurchase"] = formatted_prefered_model_stats["AmountPerPurchase"]

# Calculate the percent of total for each column
total_revenue = formatted_prefered_model_stats["Revenue"].sum()
total_purchases = formatted_prefered_model_stats["Total Purchases"].sum()
formatted_prefered_model_stats["PercentRevenue"] = (formatted_prefered_model_stats["Revenue"] / total_revenue * 100).round().astype(int)
formatted_prefered_model_stats["PercentPurchases"] = (formatted_prefered_model_stats["Total Purchases"] / total_purchases * 100).round().astype(int)

styled_prefered_model_stats = formatted_prefered_model_stats.style.background_gradient(cmap="RdYlGn")
styled_prefered_model_stats.background_color = 'lightblue'

styled_prefered_model_stats.set_caption("Customer Total Sum")
Out[132]:
Customer Total Sum
  Revenue Total Purchases AmountPerPurchase PercentRevenue PercentPurchases
DBSCAN_cluster          
0 85075 7275 12 6 22
1 348023 8695 40 26 26
2 179793 6692 27 13 20
3 263934 3387 78 20 10
4 466427 6891 68 35 21
In [133]:
prefered_model_stats_per_customer = df.groupby('DBSCAN_cluster')['Expenses', 'NumTotalPurchases'].mean()
prefered_model_stats_per_customer['AmountPerPurchase'] = prefered_model_stats['Expenses'] / prefered_model_stats['NumTotalPurchases']


formatted_prefered_model_stats_per_customer = prefered_model_stats_per_customer.round().astype(int)
formatted_prefered_model_stats_per_customer["Expenses"] = formatted_prefered_model_stats_per_customer["Expenses"]
formatted_prefered_model_stats_per_customer["NumTotalPurchases"] = formatted_prefered_model_stats_per_customer["NumTotalPurchases"]
styled_prefered_model_stats_per_customer = formatted_prefered_model_stats_per_customer.style.background_gradient(cmap="RdYlGn")
styled_prefered_model_stats_per_customer.background_color = 'lightblue'

styled_prefered_model_stats_per_customer.set_caption("Purchase Stats per Customer")
Out[133]:
Purchase Stats per Customer
  Expenses NumTotalPurchases AmountPerPurchase
DBSCAN_cluster      
0 88 8 12
1 868 22 40
2 518 19 27
3 1660 21 78
4 1405 21 68

In order to develop a better understanding of our ideal customers, it is important not to immediately jump to prescriptions based solely on data analysis. Instead, we should engage in conversations with the customers and cashiers to gain deeper insights. By actively talking to them, we can uncover the contextual nuances and obtain firsthand information about their preferences and behaviors. This approach will allow us to bridge the gap between the data and real-life experiences, enabling the development of a more informed and effective marketing strategy tailored to the needs and desires of our target audience.

3. Refined insights:

  • What are the most meaningful insights from the data relevant to the problem?

Customer profile 0: Represents the largest number of customers, making up 44% of the total customer base.

Customer profile 1: Represents the highest number of purchases, accounting for 26% of all purchases. The overall number of purchases appears to be more consistent compared to revenue.

Customer profile 2: Demonstrates moderate characteristics across various metrics.

Customer profile 3: Exhibits the highest average purchase amount but represents the smallest percentage of customers, only 7%. They make the fewest number of purchases.

Customer profile 4: Contributes the highest percentage of revenue, accounting for 35%. They have a moderate number of purchases and a significant average purchase amount..

  1. To enhance marketing effectiveness, reallocate marketing spend from customer profiles 0 and 1 to customer profiles 3 and 4.
    • Focus marketing on wines and meats. How can we more catalogs to similar people? Important note is they have small family size of 2. Number of web visits is low, we should find out why and learn how to proceed.

What are the key features that describe each segment?¶

Cluster 0:

  • Lowest income, low NumDealPurchases
  • Lowest number of accepted campaigns
  • Lowest expenses, lowest NumTotalPurchases

Cluster 1:

  • Lowest number of accepted campaigns

Cluster 2:

  • High NumWeVisitsMonth
  • Low number of accepted campaigns
  • Low expenses

Cluster 3:

  • Highest income, lowest NumDealPurchases
  • Highest expenses

Cluster 4:

  • High income, lowest NumDealPurchases
  • Low NumWeVisitsMonth
  • Low number of accepted campaigns
  • High expenses, moderate NumTotalPurchases

Which segments should we focus on? Are there any we should avoid?¶

We should redirect marketing efforts away from cluster 0 to cluster 1 or 4 which produce more revenue.

Executive Summary¶

The marketing campaign dataset 2,240 customers and tracks 27 features. After dropping columns with missing income, which accounted for 1% of the data, exploratory data analysis (EDA) was conducted. Some outliers were removed, followed by feature engineering, where values of columns were aggregated. PCA (Principal Component Analysis) was applied to help reduce the collinearity between variables and provide more meaningful clustering insights.

Multiple clustering models were built and evaluated using silhouette scores and insights from box plots. The DBSCAN technique outperformed others with a score of 0.34. DBSCAN is a density-based clustering algorithm that effectively identifies clusters of arbitrary shape and handles noise or outliers in the data. Compared to other techniques, its highest score indicates that it captured the underlying structure more effectively and produced meaningful clusters. Based on an analysis of the DBSCAN clusters, it is recommended to redirect marketing efforts from cluster 0 to clusters 1 or 4. These clusters have shown a greater potential for generating higher revenue. By reallocating marketing efforts towards clusters 1 and 4, businesses can target segments that have proven to be more lucrative. This strategic shift in focus will result in improved resource allocation and better marketing outcomes..

Improvement suggestions include enhancing feature engineering techniques, effectively handling outliers, and consulting industry and company experts for valuable insights to enhance the clustering process. Additionally, engaging in conversations with customers and cashiers is recommended to gain a better understanding of ideal customers. By bridging the gap between data analysis and real-life experiences, we can develop a more informed and effective marketing strategy that is tailored to the needs and desires of the target audience.

The final solution design proposal suggests adopting the DBSCAN model due to its highest silhouette score and ability to provide meaningful insights. The model captures the underlying structure, identifies clusters of arbitrary shape, and effectively handles noise or outliers. Further evaluation and adjustments should be conducted iteratively to enhance the model's performance and refine the analysis, considering the impact on revenue and costs.

Problem and Solution Summary¶

The problem at hand is to identify different segments within the existing customer base of a marketing campaign dataset. This involves analyzing the customers' spending patterns and past interactions with the company to gain insights into their behavior and preferences.

Reason for the proposed solution design: The proposed solution design suggests using the DBSCAN model. This choice is based on the model's high silhouette score, indicating its ability to capture meaningful clusters within the data. DBSCAN is known for its capability to identify clusters of arbitrary shape and effectively handle noise or outliers in the dataset.

Key Points of the Proposed Solution Design:

  1. Adoption of the DBSCAN model: The DBSCAN model is chosen based on its superior silhouette score, indicating its ability to capture the underlying structure of the customer base. It can identify clusters of varying shapes and handle noise or outliers effectively.
  2. Evaluation and adjustments: The proposed solution emphasizes the need for iterative evaluation and adjustments to enhance the model's performance and refine the analysis. This iterative process ensures that the model provides accurate and meaningful insights.
  3. Consideration of impact on revenue and costs: The solution design recognizes the importance of considering the impact of segment identification on both revenue and costs. By identifying different customer segments, the business can allocate resources more efficiently and tailor marketing strategies to target specific segments, potentially increasing revenue and minimizing costs.

Potential impact on the problem/business: The proposed solution design will provide valuable insights into the customer base and their segmentation. It will address key questions such as the different customer segments and their average spending, the key features that describe each segment, and which segments should be focused on or avoided. By understanding the customer segments and their behavior, businesses can tailor their marketing strategies, products, and services to better meet the needs and preferences of each segment. This targeted approach can lead to increased customer satisfaction, improved marketing campaign effectiveness, and potentially higher revenue. The solution's validity lies in its use of a reliable clustering algorithm, evaluation based on the Silhouette score, and the iterative refinement process to ensure the best possible results.

Recommendations for Implementation¶

Key Recommendations to Implement the Solution:

  • Data Preprocessing: Prioritize data preprocessing techniques such as handling missing values, outlier removal, and feature engineering to improve the quality and relevance of the dataset.
  • Model Implementation: Implement the DBSCAN algorithm as the clustering model of choice due to its ability to capture the underlying structure and handle noise or outliers effectively.
  • Evaluation and Refinement: Continuously evaluate the clustering results and refine the model iteratively.
  • Resource Reallocation: Based on the analysis of the DBSCAN clusters, redirect marketing efforts from the less promising clusters (e.g., cluster 0) to the more lucrative clusters (e.g., clusters 1 and 4). This reallocation can help optimize resource allocation and improve marketing outcomes.
  • Enhanced Feature Engineering: Explore advanced feature engineering techniques to further enhance the clustering process. Consider incorporating additional relevant data sources or creating new meaningful features that can provide deeper insights into customer behavior.

Key Actionables for Stakeholders:

  • Data Analysts: Conduct thorough data preprocessing, including handling missing values, outlier removal, and feature engineering. Implement the DBSCAN clustering algorithm and evaluate its performance using appropriate metrics.
  • Marketing Managers: Collaborate with data analysts to understand the cluster analysis results and identify key customer segments. Utilize these insights to reallocate marketing efforts and resources towards the more promising clusters for targeted marketing campaigns.
  • Business Executives: Review the analysis results and proposed reallocation strategy. Evaluate the potential impact on revenue and costs and provide necessary support for the implementation of the recommended changes.

Benefits of the Solution:

  • Enhanced Resource Allocation: By identifying distinct customer segments, businesses can allocate their marketing resources more efficiently, targeting the segments with higher revenue potential. This can lead to optimized resource utilization and improved marketing outcomes.
  • Tailored Marketing Strategies: Understanding customer segments allows businesses to tailor their marketing strategies, products, and services to the specific needs and preferences of each segment. This targeted approach can increase customer satisfaction and improves campaign effectiveness.
  • Revenue Growth: By focusing on the more lucrative customer segments, businesses have a higher probability of generating increased revenue. The analysis provides insights into the segments with the most potential, enabling businesses to prioritize their efforts and resources effectively.

Costs and Benefits Assumptions

  • Cost of Implementation: The implementation cost includes data preprocessing efforts, model development, and evaluation. It assumes a team of data analysts working for a specific period to complete these tasks.
  • Revenue Increase: By utilizing the statistics and business insights obtained from the DBSCAN clusters, we can perform a sensitivity analysis in order to develop an effective business strategy.
  • Cost of Reallocation: There might be some costs associated with reallocating marketing efforts, such as developing new marketing materials, conducting targeted campaigns, or modifying existing strategies. These costs should be considered when evaluating the benefits.

Key Risks and Challenges:

  • Data Quality: Poor data quality, such as missing values or inconsistent records, can affect the clustering results. Ensure rigorous data preprocessing and quality control to mitigate this risk.
  • Interpretation of Clusters: The interpretation of clusters requires domain knowledge and expertise. Incorrect interpretation or misunderstanding of clusters may lead to suboptimal marketing strategies and resource allocation.
  • Model Performance: The clustering model's performance depends on various factors, such as algorithm selection, parameter tuning, and the quality of features used. Continuously evaluate and refine the model to address any performance issues.

Further Analysis and Associated Problems:

  • Customer Profiling: Conduct further analysis to develop comprehensive customer profiles for each identified segment. This includes analyzing demographic information, preferences, and purchase behavior within each segment to gain a deeper understanding of their characteristics.
  • Customer Lifetime Value: Explore the calculation of customer lifetime value (CLV) for each segment to prioritize marketing efforts towards segments with higher CLV. This analysis helps identify segments that are most valuable in terms of long-term profitability.
  • Campaign Effectiveness Measurement: Develop a framework for measuring the effectiveness of marketing campaigns targeted at different segments. This includes defining relevant metrics, setting up tracking mechanisms, and conducting post-campaign analysis to assess the impact on revenue and customer engagement.
  • Competitor Analysis: Conduct a comparative analysis of the competitor's customer segmentation strategy to identify potential gaps or opportunities. Understanding how competitors target different segments can provide insights into effective strategies and differentiation possibilities.